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ABSTRACT 



Since 1951 when Robbins & Monro's pioneering paper on 
stochastic approximation was published, many articles have 
appeared dealing with extensions, modifications, methods 
and applications of stochastic approximation. V/hile the 
concepts involved are relatively simple, but mathematically 
difficult, the information concerning specific results has 
been widely scattered and difficult to collect for the 
interested researcher. This paper will attempt to discuss 
the major results and will provide the necessary references 
to direct the user to more specific findings. 



2 



TABLE OP CONTENTS 



I. 

II. 



III. 



INTRODUCTION 

MOTIVATING STOCHASTIC APPROXIMATION 

A. QUANTILE ESTIMATION 



B. LEVEL DETERMINATION 

C. ROUND-OFF ERRORS 

THE ROBBINS-MONRO METHOD 

A. WEAKENING THE CONDITIONS FOR CONVERGENCE 

B. CONVERGENCE WITH PROBABILITY 1 

C. A FURTHER WEAKENING OP CONDITIONS FOR 

CONVERGENCE — 

D. THE MULTIDIMENSIONAL CASE 

E. DVORETZKY'S GENERALIZED PROCESS 

P. EXPECTED SQUARRED ERROR 

G. EXPECTED SQUARRED ERROR IN THE LINEAR 

CASE 

H. RATE OP . CONVERGENCE 

I. ASYMPTOTIC NORI^.ALITY 

J. SELECTION OF STEP SIZE, a 

n 

K. ACCELERATED CONVERGENCE 

L. CONFIDENCE INTERVALS AI^D STOPPING TIMES 

M. DYNAMIC STOCHASTIC APPROXIMATION 

N. CONTINUOUS STOCHASTIC APPROXIMATION 

O. EXTENSIONS OP CONTINUOUS STOCHASTIC 

APPROXIMATION 



6 

10 

10 

10 

11 

11 

13 

14 

15 

16 
18 
22 
26 



2 ) 
30 

33 

34 
36 
39 

41 

42 



3 



IV. FINDING THE MAXIMUM OP AN UNKNOV/N REGRESSION 

FUNCTION THE KIEFER- WOLFOWITZ METHOD 

A. V/ITH CONSTANT COEFFICIENTS Ij? 

B. CONVERGENCE WITH PROBABILITY ONE kl 

C. MULTIDIMENSIONAL K-W 

D. ASYMPTOTIC PROPERTIES OP THE K-W PROCESS 51 

E. MAXIMUM SAMPLE EXCURSIONS IN THE K-W PROCESS - 52 

P. ACCELERATED CONVERGENCE FOR THE K-W PROCESS — 53 

G. THE CONTINUOUS K-W PROCESS 5^ 

V. PRACTICAL ASPECTS 59 

A. CHOICE OP 59 

B. ESTIMATING THE SLOPE TO IMPROVE 

ASYMPTOTIC VARIANCE 6l 

C. SMALL SAMPLE THEORY 6^ 

1. Choice of X^ 6^ 

2. Choice of Multiplier, 64 

3. How to Allocate Samples 65 

D. ESTIMATION OF EXTREME QUANTILES 6? 

E. THE CASE WHERE M(x) STOPS BEING A CONSTANT 69 

F. BOTH VARIABLES SUBJECT TO ERROR 71 

G. THE CASE OP a UNKNOWN 72 

VI. APPLICATIONS 75 

A. APPLICATION TO A PROBLEM IN BIOLOGICAL 

RESEARCH 76 

B. AN APPLICATION TO TAILORED TESTING 77 

C. UPGRADING OP INERTIAL NAVIGATION SYSTEMS 79 

D. APPROXIMATION OF DISTRIBUTION AND 

DENSITY FUNCTIONS 79 



4 




1 





E. AN APPROACH TO PATTERN RECOGNITION USING 

STOCHASTIC APPROXIMATION TO MINIMIZE 

RISK 84 

/ 

VII. AREAS FOR FURTHER STUDY 88 

A. DEVIATIONS FROM THE LINEAR CASE 88 

,, B. STOPPING RULES 88 

C. POSSIBLE WEAKENING OP CONDITIONS ON 88 

D. REPEAT SIMULATION OP THE KIEPER-WOLPOWITZ 

PROCESS 89 

E. MULTIDIMENSIONAL EXTENSION OP DUPAC AND 

KRAL'S RESULTS 89 

LIST OP REFERENCES 90 

INITIAL DISTRIBUTION LIST 101 

FORM DD 1473 103 



5 



I. INTRODUCTION 



In many areas of analysis in bioassy, sensitivity 
testing or learning we are concerned with a level of output, 
Y, given a certain level of some input, x. For each given 
level of X, the resultant output is not deterministic but 
has some underlying probability distribution, P(Y|X). 

Hence it is then common to refer to the response function 
of X, denoted M(x), as the expected value of Y given x. 

CO 

(I.e., M(x) = / Y(x)dP(Y|x) = ECYjx].) 

-.00 

In usual analysis of the response function, M(x), it 
is assumed that the function is of known form with unknown 
parameters say: 

M(x) = Bq + B^x + + ... 

where the parameters, B^, are estimated on the basis of 
observations Y^, Y-^, corresponding to observed 

values x^ , ... x^. The method of least squares, for 

example, yields the estimators of B^ which minimize the 
sum of the squared errors. 

However cases often arise in which one has little 
prior knowledge of the actual form of M(*) or one is only 
interested in trying to estimate the value 6 such that 

M(e) = a 
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VJhere a is a specific desired response level. V/e desire 
to find a sampling scheme such that ->■ 0. 

Robbins and Monro [Ref. 88] presented the following: 

THEOREM : Let M(x) be a given function and a a given constant 

such that the equation M(x) = a has a uniquely defined root* 
X = 6 . Let Y(x) denote a realization of an experiment at 
"control level" x. Assume Y(x) has distribution 

C» 

P(Y(x) < Y) = H(Y|x) such that M(x) = / YdH(Y|x). 

— 00 

(I.e., M(x) = E(Y|x).) Choose arbitrary and define the 
recursive relation: ~ ^n " ^n^^n^^ ‘ 

If there exists a positive constant C such that 



P[[Y(x) I < C] = 1 



and if 



and 



for some 6 > 0 

M(x) £ a - 6 for x < 0 

M(x) > a + 5 for x > 0 



* Mote that this requires that for some 6 > 0, 

M(x) £ a - 6 for x < 0 and M(x) ^ a + 6 for x > 0, 
but does not specifically require that M(0) = a. 
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then for a = 1/n 
n 

Lim E[(x^ - 0)^] = 0 . 

n-»-oo 

The procedure of recursively defining ^ function 

of X by 
n 



n+1 



= X 



n 



a (a - 
n 



Y(x )) 
n 



is referred to as the Robbins-Monro method or procedure. 

(Note that the process is a first order Markov process, 
although it is in general non-homogeneous . ) Papers which 
followed Robbins and Monro’s discussed topics such as 
convergence, finding the maximum (or minimum) of a function, 
multidimensional applications, and accelerated processes 
to name a few. 

In the first few years of stochastic approximation 
survey papers by Derman (1956) [Ref. 25], Schmetterer (I960) 
[Ref. 101],, and Loginov (1966) [Ref. 8l] presented major 
results through their respective date of publication. A 
text on the subject v;as attempted by V/asan (1969) [Ref. 129] 
but received strong criticism because of serious oversights 
and many misprints.^ While the aforem.entioned publications 
contained only the mathematical formulation, other treatments 



1 Dupac, V., Book Reviev;, Annals of Mathematical 
Statistics , v. ill, p. 1131, 1970. 
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by Fu [Refs. 53 > 5 ^] and Wetherill [Ref. 130] contained 
predominantly practical and intuitive information and little 
mathematical background. This paper will attempt to present 
the major results of both mathematical formulation and 
practical applications and to discuss the intuitive meaning 
where it is applicable. The list of references is intended 
to be as complete as possible on the subject. Consequently 
many of the bibliographical entries are not specifically 
referenced . 
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II. MOTIVATING STOCHASTIC APPROXIMATION 



In certain applications, as in bioassy, sensitivity 
testing, or fatigue trials, the statistician is often 
interested in estimating a given quantile of a distribution 
function or a level of response. Situations of this type 
are candidates for solution by stochastic approximation 
methods. Examples of these situations are: 

A. QUANTILE ESTIMATION 

Suppose we are testing the resistance of a metallic 
component to fatigue fracture. Let F(x) denote the proba- 
bility that a specimen will fall if subjected to x cycles 
in a trial. Then a specimen, when tested in such a way, 
represents an observation v;hich takes on a value one or 
zero depending on whether or not it fractures in x cycles. 
Thus in the notation of the previous section, Y(x) = 1 if 

the specimen fractures and Y(x) = 0 otherv/ise, so that 
00 

M(x) = f Y(x)dP(x) = F(x). It is of Interest to 

— 00 

estimate the number of cycles, x, such that for a given 
a, F(x) = M(x) = a. 




We wish to administer sample doses of a drug to 
laboratory animals, say rats, such that we determine the 
dosage such that 50% of them die on the average. In this 
case a = .5 in our problem formulation and we desire to 
solve M(x) = .5 for x. 
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B. LEVEL DETERMINATION 



Suppose In production It Is desired to find the level 
of some material such that a characteristic, say the 
viscosity of the finished product, is a pre-determined 
level'. However each batch is subject to impurities and 
reacts as a stochastic realization. A stochastic approxi- 
mation scheme may be devised to automatically set and 
correct the desired input flov; to produce the desired results. 

C. ROUND-OFF ERRORS 

As stated by Schmetterer [Ref. 101] we can consider 
an application of the RM process for the problem of round- 
off errors. This problem occurs, for example, if one solves 
equations by classical iteration process using electronic 
computers. Define for every real number X a random variable, 
Y(x), in the follov;lng way. 

P[Y(x) = [x]] = 1 - (x - [x]), 

P[Y(x) = [x] + 11 = X - [x]* . 

Note that E[Y(x)] = x. From here we can deduce as a 
pattern for more general theorems the following result. If 
one solves a linear equation by an iterative procedure and 
modifies it by using for every step of the iteration the 



Note that [x] denotes the largest Interger contained in 
X. For example [2.8?] = 2. 
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round-off rule given above, then the modified procedure 
converges with probability one to a solution of the given 
equation . 
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III. THE ROBBINS-MONRO METHOD 



In the Introduction the first of two theorems from the 
original Robbins-Monro paper was presented. This first 
theorem required that the response function, Y(x), be 
bounded, allowed discontinuities in the function M(x) = E[Y(x)], 
and did not specifically require that M(x = 0) = a, the 
desired response level. The second theorem of Robbins-Monro 
is presented below: 

THEOREM [Ref. 88] 

Let the sequence a^ be of the form 1/n and assume there 
exists some constant C > 0 such that 

P{ |Y(x) I < 0} = 1, 



and that the conditions 



(1.) M(x) is a nondecreasing function, 
(ii.) M(e) = a, 

(ill. ) M' (e) > 0 



are satisfied. Then defining the recursive relation 



X 



n=l 



X +a[a-Y(x)] 
n n 



implies the result 



Lim E[(x - e)^] = 0 
n->co ^ 

A, WEAKENING THE CONDITIONS FOR CONVERGENCE 

Wolfov;itz [Ref. 132] in response to questions of 
Robbins and Monro shoived that if the conditions of the 
response function, M(x), satisfy 

|M(x) I < C, 



/ (Y(x) - M(x))^dP(Y|x) < 0 ^ < + «., 

— OQ 



along with 

M(x) < <» for X < 6, 

M(e) = a, 

M(x) > a for x > 6 , 

M(x) strictly increasing when |x - 0| <6 for some 6 > 0 
and 



inf |M(x) - a| >0. 

|x-e| > 6 

Then for x^ defined as in the RM process x^ converges in 
probability to 0 , 



li| 



B. CONVERGENCE WITH PROBABILITY 1 

Kalllanpur [Ref. 68] and Blum [Ref. 6], both proved a 
convergence which is stronger than convergence In mean 
square (which implies convergence in probability), conver- 
gence- with probability 1. Blum [Ref. 6] proved that the RM 
process converges with probability 1 under conditions even 
weaker than those of Wolfowitz [Ref. 132]. While V/olfowitz 
required that the regression function M(*) be bounded, 

Blum only required that it lie between two lines. 

BLUMS* THEOREM [Ref. 6] 

Let M(x) be the regression function. We assume that 
M(x) is measurable and satisfies the following conditions; 

M(x) < C + djx] 
for some C, d > 0 



/ (Y - M(x))^dF(Y|x) _< 0^ < + 00, 

— .oo 



M(x) < a for x < 0, 



and 



M(x) > a for x > 0 , 

inf ]m(x) - a| > 0 
6^<_| x-0 ] £6 2 
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for any pair (6^, 62^’ 



if moreover 



2 a„ 



n=l 



n 



and 



E a < + 00 

T n ’ 

n=l 



then converges to 0 with probability 1. 



I. e . 



P{lim X = 6 } = 1 . 

n->co 

C. A FURTHER WEAKENING OP CONDITIONS FOR CONVERGENCE 

In 1963 Friedman [Ref. 51 j further weakened the requir 
ments for convergence with probability 1 by removing the 
necessity for M(x).and a (x) to be bounded by a linear 
function and a constant respectively. Friedman’s theorem 
is presented here : 

THEOREM [Ref. 51 ] 

Let f(x) be a function v;hich is positive and 'bounded 
in any finite interval. Let the following conditions be 
satisfied : 
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M(x) < (L |x| + K)f(x) 



for constants L, K, M > 0, 

a^(x) £ 0^f (x) , 



M(x) < a for x < 6, 



and 



M(x) > a for x > 6, 



inf |M(x) - a| > 0 



for any pair (6^,62), then the sequence defined by 



X 



n+1 



X - a [a - Y(x )]/f(x ) 
n n n n 



converges to 0 with probability 1. 

This theorem of Friedman enables one to construct a 

2 

convergence process when |M(x)| and o (x) are bounded by 

known functions f^(x) and f2(x). One then takes 

u 

f(x) = Max[f^(x) , (f^ (x) ) ^] . This procedure is also applicable 
where f(x) is decreasing to zero for large values of x. 

However the convergence is relatively slow. 
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Later Gladyshev [Ref. 57] simplified the conditions 
for convergence with probability 1 with the following: 

THEOREM 

Let M(x) be a measurable function such that 

inf (x - 6){M(x) - a} > 0 for all e > 0. 
e< I x-0 I <£-1 

Moreover assume there exists a positive number d such that 
for all X we have 

E[y^(x)] < d(l - x^) 

and if (a } satisfies the conditions previously stated then 
n 

x^ converges to 0 with probability 1 where x^ is defined 
as in the original RM process. 

D. THE MULTIDIMENSIONAL CASE 

Blum [Ref. 7] was the first to generalize the Robbins- 
Monro process to the multidimensional case. He considered 
the following problem: 

Assume that we are given a family of N random variables 

Y|j^(x^,...x^), ...» ( x^ j . . . , x^ ) 

with distribution functions 
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moreover assume that 



M.(x^,...x ) = / Y. dP. for 1 = 

(l.e. is the corresponding regression function in the 
ith dimension). 

It is desired to construct a sequence whose limit is 
the root vector of the system of equations 






> X ) = 

’ n 



a. . 

1 



For simplicity it is assumed that all = 0 and that 

M^(0) = 0. 

Let f(x) be a real function that is defined on real 
N-dimensional space and has continuous first and second 
derivatives. Let A(x) = ( 9^f/9Xj^9 x^ ) denote the matrix of 
second derivatives and let D(x) = (9f/9x^) denote the vector 
of the first derivatives. 

In matrix form the Robblns-Monro process is of the form 

X T = X + a Y where it is assumed as before that 
n=l n n n 

OQ OO 2 

2 a = and E* a < + ", 
n=l " n=l " 

Observe the following notation: 

Let U(x) = <D(x),M(x))> (l.e. the scalar product of, D(x) 

and M(x)). 
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Let V (x) = E{<Y ,A(x + Qa Y)Y >} where 0 < q < 1. 

3. X XX. 

THEOREM [Ref. 7] 

If there exists a real function f(x) with continuous 
derivatives that satisfies the conditions 

(i) -f(x) > 0, 

(ii) Sup U(x) < 0 for e > 0, 

I |x| |>e 

(iii) Inf |f(x) - f(0)| >0 for e > 0, 

||x||>c 

(iv) V < V < + 0 O for all a, 

then the sequence previously defined converges 

almost surely to zero. 

It should also be noted that the multidimensional case 
is a direct extension of theorems by Derman and Sacks [Ref. 26 
and by Gladyshev [Ref. 57] where we think of x^, Y^, 8, a, 
and M(x) as N dimensional vectors and treat multiplication 
as the vector scalar product. Of Interest is a special case 
when the regression function, M(x) is linear , M(x) = a 
where M is a symmetric matrix. The following modified RM 
process was proposed by Dupac [Ref. 33] for this special 
case . 

Here Y is a random vector v;hose distribution function 

is F(y|Y^ - a) where Y is a random vector v/ith distribution 
n n 
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function P(Y^|x^). Then the follov/ing theorem is presented 
by Dupac. 

THEOREM [Ref. 33] 

Assume that 



R 



/ I !y - M(x) 1 I dF(Y |x) < a' 



n 



< + 00 



and 



2 

= min X. 

1 i 1 

are satisfied where the .X^ are the characteristic roots 
of the matrix M. Then if a > 1/2K^, then the sequence 
defined by 

X , , = X - — Y 

n+1 n n n 

converges to 0 with probability 1 and 
E{||Xj^ - 0l 1^} = 0 (1/n) 

There are certain situations v/here the multidimensional 
case can be reduced to a one dimensional case. The results 
are by Eppling [Ref. 37] and require general stochastic 
approximation theorems of the Dvoretzky type. 
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E. DVORETZKY’S GENERALIZED PROCESS 



Dvoretzky [Ref. 36] has suggested that any stochastic 
approximation procedure may be viewed as an ordinary 
deterministic (error free) successive approximation method 
with a noise component superimposed on it at each step. 

On the basis of this concept a very generalized class of 
stochastic approximation theorems can be studied. 

Assume that T^(x^ , . . . ,x^) is a Borel-measurable sequence 
of transformations from n-dim.ensional Euclidian space, R^, 
into R^. One may then construct the sequence from the 
relation 



n +1 



= T (Xt , 
n ■ 1 ’ 



X ) + Z 



n 



n 



where T (x, ,...,x ) is the error free transformation and 
n 1 ’ ’ n 

Z^ is the error. Dvoretzky then proved the following theorem 
THEOREM [Ref. 36] 

Let sequences of non-negative 

real numbers, such that. 



Lim 

n->co 



a 



n 



0 , 



00 

I B 



n=l 



n 



< 



00 

> 



I 

n=l 



Y 



n 
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Moreover, assume that the condition 






is satisfied for all real r^,...,r^; also that 



E E(Z ) < CO 
, n 
n=l 



and 



E(Z lx, ,...,x ) = 0 
n 1 ’ * n 



with probability one.* Then the sequence defined by 



X., =T (x, ,...,x ) + Z 
n+1 n 1’ ’ n n 



converges to the desired quantity, 0, in mean square and 
with probability 1. I.e., 



11m E[(x - 0)“"] = 0, 

n-»-oo 



and 



* Note that this condition is satisfied if the Z^ 
are a sequence of Independent errors for vjhich E(Z 0 

for all n. 
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e 



1 



p 



lim X = 

n->-oo 



It can easily be shov/n that the Robbins-Monro procedure 
Is a special case of Dvoretsky’s generalized procedure. To 
do this write the normal R-M relation 



n+1 



= X 



n 



+ 



a {ot - 
n 



Y(x )} 
n 



as 



X , - = X + a [a-M(x ) + a [M(x )-Y(x )]. 

n+1 n n n n n n 



Then letting 



T 

n 





X + a [a - M(x )] 
n n n 



and 



Z = a [M(x ) - Y(x )] 
n n n n 

we have the RM procedure in Dvoretsky’s format. In a 
similar manner the Kiefer - Wolfowltz procedure, which will 
be discussed later, can be written as a special case of 
Dvoretsky’s theorem. 

Dvoretsky extended his generalized procedure even 
further by replacing the sequences a^, B^, and 

2H 



non-negative functions (r^ , . . . r^) , B^(r^, . . .r^) , and 

respectively provided that they satisfy the 
fbllov/ing conditions : 

(i) the functions a (r, . . . . ,r ) are 

n 1’ ’ n 

uniformly bounded and lim a (r, , . . . ,r ) 

fj-»-oo n 1 n 

uniformly for all 
sequences r^, . . . ,r^, . . . 

(ii) the functions B (r^j.-.-r ) are 

B 1 , n' 

measurable and E B (rT....,r ) is 

n = l 1 n 

uniformly bounded and uniformly conver- 
gent for all sequences r^,...,r^. 

(iii) the functions Y (r,j...,r ) satisfy 

I Y (r, ,...,r ) = ~ uniformly for 
n=l ^ ^ 

all sequences r^,...,r^ for vjhich 

SUP |r 1 < L where L is a finite 
n=l,2,. . . 
number . 

The introduction of Dvoretzky’s general conditions 

2 

allowed regression functions of the form M(x) = -x or 
2 

M(x) = Exp (-X ) to be applicable to stochastic approxima- 
tion type theorems. The most comprehensive presentation 
of Dvoretzky stochastic approximation theorems has been by 
Venter [Ref. 120] in 1966. Venter's theorems generalized 
the vrork of Dvoretzky [Ref. 36] and Wolfowitz [Ref. 133] for 
transforms on the real line, of Derman and Sacks [Ref. 26] 
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for finite dimensional Euclidian spaces, and of Schmetterer 
[Ref. 101] for Hilbert spaces. 

Block [Ref. 5] had proposed a more general type of 
stochastic approximation taking place on a normed vector 
space . 

P. EXPECTED SQUARED ERROR 

While Blum [Ref. 6], Dvoretzky [Ref. 36], and Dupac 
[Ref. 32] vjere establishing conditions under which 

p 

b^ = E[(x^-6) ] -»■ 0, others such as Chung [Ref. 1^], Hodges 

and Lehmann [Ref. 6^1], Kallianpur [Ref. 68], and Schmetterer 

[Ref. 101] were trying to establish bounds on b^. (Note 

2 

that b = variance + (bias) .) Below is Schmetterer ' s 
n . 

result for the bounded case. 

THEOREM [Ref. 101] 

Let M(x) be a Borel-Measurable function that satisfies 







(!) 


P{ [Y(x) 1 <C}=1 for some 


constant |C 


and 




(ii) 


(x-e){M(x)-a}>0 


for all Xt^O. 


Also 


there 


exists 


an £ > 0 , and positive 


constants C 


such 


that 


(Hi) 


|M(x)-a| > C^|x-6| 


for 1 x-6 1 <e 






(iv) 


lM(x)-al ^ C 2 


for |x-6|>e 
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Then 



n-1 Coa. n-1 c,a, 

‘’n i *'l " (1-A n ’ 

^ 1=1 ^ 1-1 1 = 1 ^ 1-1 



n-1 p 1 

^ eiC n (1 - 

1=1 r=l 



C^a 
3 r 



A 



r-1 



)] 



-1 



where e^ 



E (y^^-a)^ 



n 

aQ Is defined as 1, = E a^ , 



and there exists a constant. 




such that 



|M(x)-a| > — — |x -0 1 with probability 1. 

" ^%-l ^ 

As was noted above this theorem holds for the so called 
"bounded case." This same result holds for the quaslllnear 
case If conditions (1), (111), and (Iv) are replaced by the 
following conditions: 

There exists a such that 

E{[Y(x)-M(x)]^> < 



and the quaslllnear condltons that there exist and Cg, 

< Cg such that 

C^|x^-e| > lM(x)-al > Cg|x^-el . 

Then the above estimate of b^ holds when a^ Is substituted 

for a. /A. , and 2C^ Is substituted for C-,. 

1 1-1 o 3 

Therefore If a =a/n v;here a > 1/20^- then b = 0(l/n). 
n on 

This latter Is the most frequently used result. 
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G. EXPECTED SQUARED ERROR IN THE LINEAR CASE 



Hodges and Lehmann [Ref. 64] analyzed in detail the case 
where it is desired to estimate the value of x for v;hich 
M(x)=0 where it is assumed that M(x)=3x, and variance (Y(x)) 
=0 . Then using the RM process to define yields 

5 5 n -1 -n -1 5 n -1 „ 

b = E[x/] = x/[ n (1-Ba^)]^ + 0 ^ E a/ n (1-a^)^ . 

r=l r=l s=r+l 

In analyzing this expression it becomes obvious that the 
first term represents the expected bias based on the Initial 
choice, x^, while the second term represents the variance 
component of the error variance. Since Chung [Ref. l4] 
established that under certain conditions the sequence 
a^=c/n gives most rapid convergence of x^ to 0 , it is of 
interest to analyze the expression for expected squared 
error for this family of coefficients. 

For the first (expected bj.as to initial value, x^^) term 
the expected bias = 0 (n~^^^) if (c/n)”^ _> 3 for all n, 
but becomes quite large if (c/n)”^ < 3 • Hence as noted 
by Wetherhlll [Ref. 130] it would be more desirable to tend 
to overestimate c = 3 ~^ rather than underestimate. 

The analysis of the second term is more complicated but 
Hodges and Lehmann have shown that it is asymptotically 
equivalent to 
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if c > 1/2B 



n (2cB-l ) 
a^log n 

^ if c = 1/2B 

HnB 

and we note that c should not be less than (2B)~^ because 
of the large bias which results. 

These results give us conditions on the sequence, 
a^=c/n in terms of 6, the slope of the regression function, 
such that the bias, resulting from an initial bad guess, 
rapidly tends to zero vfith increasing sample size and the 
expected squared error of is of the order 0(l/n). The 
main shortcoming of the linear model is that we do not know 
how nearly linear M(x) must be nor how nearly constant 
variance (x) must be in order that the linear approximation 
will represent what actually happens. The only evidence or 
this point consists of a sampling experiment by Telchroew 
[Ref. 109 ]. There it was found that the linear theory is 1 
reasonable agreement with the data. 

H. RATE OF CONVERGENCE 

In a recent paper Komlos and Revesz [Ref. I 36 ] presented 
estimates of the rate of convergence of the R-M process in a 
form more concise than any previous result . They considered 
the case a = 0, ® ~ ~ presented the following 

estimates . 
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For the case where there exists L > 0 such that 



P[1y(x)-M(x)| _<L]=1 and Lim M(x)=M(°°) > L 



X-»-oo 



If the conditions 



M(x) < c^x + d^ 
M(x) > C 2 X + d 2 



if X > e = 0 



if X < 6 = 0 



are satisfied for positive constants c^, C 2 j d^, d 2 > 



then 



P[x^>e] £ e**"'^^ 



for any e > 0, where y = y{c) > 0. 

For the case where M(<») < L the rate of convergence 
is much slower than for the previous case. Specifically 



for any 6 > 0, n > nQ(6). 

I. ASYMPTOTIC NORMALITY 

The asymptotic .behavior of the higher-order moments an 
the asymptotic distribution of the random variables defined 
by the sequence was first considered in detail by 

Chung [Ref. l4]. His method is based on the moments of 
(x^-e) and his results have been widely used in papers on 
stochastic approximation. Chung's fundamental result is for 
the "bounded case" and can be stated as follows: 



M(°o) _ g 



P[x^>e] £ exp(-n 



) 
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THEOREM [Ref. 1^1] 



Let M(x) be a Borel-measurable function that satisfies 
the follov;ing conditions 



P{|Y(x)1 C^} = 1 for some constant, C^, 

(x-6) {M(x)-a} > 0 , 

M(x) = a + a^(x-6) + 0(|x-6|) , 

inf |M(x)-a| = K.(6) 0 . 

|x-e|>6 “ 

and 

E[ (Y(x)-M(x) )^] < < «> for all x . 



Let a^^ = 1/n^”^ 



where 



1 

2 ( 1 + 02 ) 



< e 




and v.'here C 2 is determined by the condition 

|M(Xj^)-a| > C2en"^ I x^-e I 



Then for any integer r > 1 



Lim E[(x -6)^] = 

n->~ 



if r is odd, 

1 P ^ 

. \^(a /2a^)2’ (r-1) if r is even. 



( 1-e )— 

and the random variable n 2 (x^-6) is asymptotically 

2 

normally distributed with mean zero and variance = o /2a^. 

A similar result was obtained by Chung for the quasi- 
linear case (i.e. M(x) lies between two straight lines with 
nonvanishing slope). In this case the boundedness assumption 
is replaced by 
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K|x-el _< |M(x)-al _< K^jx-el for K > 0, K^< «> 

and 

E[ {Y(x)-iyi(x) }^] £ ^2 < ~ for p an even integer. 

Then using a^^ = c/n where c > 1/2K, the distribution of 
1/2 

n tends to normal with mean zero and variance = 

a^c^/(2aj^c-l ) . 

For the special case where M(x) is linear Chung proves 
asymptotic normality by using characteristic functions in a 
very concise proof. While Hodges and Lehmann [Ref. 6^1] 
improved some of Chung’s results, Sacks [Ref. 95] utilized 
a central limit theorem for dependent random variables to 
obtain more general and more complete results about the 
asymptotic normality of Below is a theorem of 

Gladyshev [Ref. 57] that is a strengthened form of Sack's 
fundamental result. 

THEOREM [Ref. 57] 

Assume that the following conditions are satisfied: 

inf (x-9) (M(x)-a) > 0 for e > 0 , 

e< I x-0 I <l/e 

M(x) = a + a-^(x-Q) + 6(x,6)(x-6) , 

there exists a d > 0, such that for all x, 

E[Y^(x)] < d(l+x^) , 

Llm E[ (Y(x)-M(x) )^] = p > 0 , 

Lim Lim Sup E[(Y(x)-M(x))^$„(x)] = 0 , 

N-^co I x-6 j <e ' 
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where 



1 for I Y(x) I > N , 
p for |Y(x)| £N , 



4>„(x) = 



and 



an = An 



-1 



is such that 



Aa^ > 1/2 



1/2 

Then the distribution of n tends to normal with 

2 —1 

mean zero and variance = A (2Aa^-l) irp . 



J. SELECTION OF STEP SIZE, a^ 

As we have noted thus far the sequence must 
essentially have the same asymptotic behavior as the 
harmonic series, 1/n, which satisfies the conditions 



I 

n=l 



a 



n 



oo 



and 



I 

n=l 




< 00 



We can intuitively see that the first condition is necessary 
to guarantee that the sequence, , does not get trapped 

in any finite interval while the second condition is neces- 
sary for the convergence of the expected squared error term. 

However it is reasonable to ask if there is a sequence, 

2 

ta_} , which minimizes E[(x_-6) ] after some fixed number 
n n 

of observations, say N. Dvoretzky [Ref. 3^] solved this 
problem for the Robbins -Monro Process. 



THEOREM [Ref. 36] 

Assume that a random variable, Y(x), satisfies the 
conditions 

(i) E[Y^(x)] < < <» 
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and assume that M(x) Is such that 



(li) 



0 < A 



M(x) - e 
- X - e 



< B < 00 



and if it is known that 



(iii) |x^ 



el < C < 




A) * 



Then if the sequence, a = —5 p 

“ a + nA 

Monro process, then the resultant 



is used in the Robbins- 



E[(x^ - e)'^] < -2 



• 2 2 
a c 



a .+ (n-l)A^c^ 



is obtained. The choice of here is optimal in the 

minimax sense in that for any other choice of there 

exist Y(x) and x^ that satisfy conditions ( 1 ) and (ill) for 
which the above bound on expected squarred error does not 
hold. 

Now it is obvious that this information is of limited 
use to the experimenter who has little a priori Information 
with which to choose a^^. Therefore for practical choice of 
the sequence, a^, the reader is directed to Section V.A. 
where this problem is discussed. 



K. ACCELERATING CONVERGENCE 

When the initial guess, x^, is far from the desired 
value of 6, the Robblns-Monro procedure approaches 6 very 
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slowly because we are taking smaller and smaller steps. 

Kesten [Ref. 69 ] proposed the method of accelerating the 

convergence of a stochastic approximation algorithm based 

on not decreasing the step size, a^, if the difference 

(x^ - has the same sign as (x^ ^ 3-^^ decreasing 

the step size if the signs differed, indicating that we 

may be in the region of 9. (Higher order schemes are also 

proposed. ) 

It was shown that there exists a 9’, not necessarily 
identical with 9, about which fluctuations in sign occur 
more frequently in a finite number of trials. The value 
of X = 9' is defined by the intersection of the line 
Y(x) = a and the locus of medians of the densities 
|^(F(Y|x)) for any x. If the density |p-(F(Y|x)) is 
symmetric, then 9' = 9. Even if the fluctuations occur 
about a 9’, different from 9, x^ still converges in 
probability to 9 as Kesten proved. 

Authors such as Odell [Ref. 87 ], Sinha and Griscik 
[Ref. 105 ] > Sielkeh [Ref. 104], and Newbold [Ref. 86] have 
presented accelerated stochastic approximation methods of 
their own and have compared them with the original R-M 
method and Kesten ’s method. 

Another method of accelerating convergence was proposed 
by Fabian [Ref. 40]. This method is an analog of the method 
of steepest ascent (descent). Fabian proposed that the 
step a^^ be determined in the following manner; for given x^ 
and y^ one makes a series of observations, , (where the 
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observations are assumed to be Independent of and y^) 

of the quantity M(x^ + Jay^) for j = 1,2... until sign 

V, = ... = sign V. - = sign V. = -sign V..,. Then choose 
1 j-1 0 J+1 

a^ = Ja. (Note here a = 0 = M(0).) Fabian proved that under 

very general conditions on V. Iteration methods converge 

J 

with probability 1. 

Authors who are interested in the practical or experi- 
mental aspects of stochastic approximation have suggested 
that the approximation method be carried out in two stages. 
The first stage would take large steps to estimate the 
region of interest while the second stage would take pro- 
gressively smaller steps and represents the fine tuning 
stage. (See Davis [Ref. 22], Wetherill [Ref. 130], and 
Goodman, Lewis and Robbins [Ref. 58].) 

L. CONFIDENCE INTERVALS AND STOPPING TIMES 

After k iterations it may be desired to obtain an estimate 
of Y and d such that 

P(|Xk+l - 9| < d) > 1 - 2 y . 

Farrell [Ref. 50] did some of the first work on confidence 
intervals of bounded length but required a priori knowledge 
of a bounded interval containing 0. 

The subject of stopping times of a non-parametrlc nature 
is an almost untouched area. Farrell stated that Mrs. Nancy 
Tapper, Cornell University, had been studying closed stopping 
rules and bounded length confidence interval procedures for 
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the median of a distribution function. However very little 
has appeared in stochastic approximation literature concerning 
stopping rules. 

The most general discussion of stopping times based on 
the asymptotically normal result was recently presented 
by Sielken [Ref. 10^] and is stated below. 

Using the definition, 



Z(x) = Y(x) - M(x) 



consider the following conditions; 



(1) Y is a positive constant less than 1/2. 

(2) The sequence of positive constants 

is such that ^ c as n ^ for 

some 0 < c < <». 

(3) The sequence has the form An~^ 

where A is a constant such that 2Aa^ > 1. 

(^0 M is a Borel -measurable function. 

(5) For e > 0, inf M(x) - a > 0 

e<x-6e 

and sup , M(x)-a<0. 

e<6-x<e~‘^ 

(6) For some constants and K 2 

|M(x) - a| £ + K 2 IX- 0 I for all x. 

(7) sup E[|Z(x)l^] = W. 

(8) Lim E[|Z(x)|^] = E[Z(6)^] = > 0. 

x^6 
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( 9 ) 



Llm Lim, Sup / |Z(x)|^dP=0 

R-»-oo |x-6|<c |Z(x)|>R 

(10) For some positive constants g and 
if |x-e| < g, 

then M(x) = a + a^(x-e) + 6(x), 
where 6(x) = o(|x-6|) as |x-0| 0. 

(11) The distribution function of Y(x), 
denoted F(Y|x), is such that for every 
y> F(y|*) is Borel-measurable . 

and 

(12) There exists e > 0 such that for every 

positive integer r 

Sup E[ I Z(x) 1^] < «>, 

|x-6|<e 

Then assuming that a 100(1 2y)^ confidence interval on 

G of length 2d is desired, the proposed stopping time for 

the R-M process is denoted N, , where N, , is the 
^ d,Y,l d,Y,l 

smallest positive integer, n, such that 

n > K ^ S ,^/(2At ^ -l)d^. 

— Y n,l 

The principle results of Sielken are: 

THEOREM [Ref. 104] 

If conditions (1) - (12) above are satisfied then 

Lim N, ^ y[K“ A^0^/(2Aa, - l)d^] = 1 , 

d-»-0 -L 

with probability 1, 
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and 



Llm P( |X„ +T - el 1 d) = 1 - 2 y. 
d-^0 



Sielken has stated that the limit in the theorem can be 
interpreted as either: 

a. The level of the sequentially determined 
bounded length confidence interval con- 
verges to the prescribed level, 1-2y, as 
the desired length, 2d, converges to zero; 
or 

b. The probability that the error in the final 
estimate of 6 is less than or equal to d 
converges to the prescribed probability, 

1 - 2y, as d -»■ 0. 

M. DYNAMIC STOCHASTIC APPROXIMATION 

Fabian [Ref. 391 and Dupac [Ref. 3^] have considered the 
case where the desired level, 6, changes during the iteration 
process. The following discussion is by Fu [Ref. 53] based 
on Dupac *s presentation. 

Let M (x) = M(x - 6 + ) such that 6 is the unique 

n n 1 n ^ 

root of M^(x) = 0. Let de a sequence of positive 

numbers, and let x^ be an arbitrary random variable. 

Define: x .^ = x * - a Y(x '), 

n+1 n n n ’ 

where 
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E[Y(x ' 
n 



)| 



X ] = 

n 



M . T (x ' ) 
n+1 n 



and 



var[Y(x^' ) |x^, . . . ,x^] = < +~. 

The meaning of the above algorithm for computing with 

the modified x . i.e. x is that x^rhen we get an estimate, 
n’ n ’ ’ 

X , of 0 , we make a correction for trend to obtain x ' before 
n n’ n 

computing x . , . It will be seen by the folloviing theorem that 
n+i 

the use of this modified algorithm is justified when 6^ is 
a linear (or nearly linear) function of n. 

THEOREM [Ref. 3^3 

Assume that the folloviing conditions are satisfied: 



(i) 


M(x) < 0 for X < 0^ and 


M(x) > 0 


for X > ^ 


(ii) 


There exist Kq, such 


that 






KqU - 1 |M(x) 1 < K, 


lU - 9j 


for all X. 


(iii) 


a^ = a/n*^, for a > 0, 


Js < a < 1 


• 


(iv) 


0^ varies in such a way 


that 





- (1 + n~^)0^ = OCn”^^) for to > a 
(v) E(x^^) < +«>. 



Then (x - 0 ) 
n n 



approaches zero in the mean and 



for h <a <2/3 



for 2/3<a<l. 

The mean square convergence, as well as convergence with 
probability 1, can be deduced from Dvoretzky's theorem, even 
under slightly more general conditions on 0^. A similar 
modification to the Kiefer-V/olfov;lt z procedure is Indicated 
to solve for a moving maximum of a regression function. 

An interesting algorithm is presented in Fu’s book 
[Ref. 53] for learning of slowly time varying parameters 
using dynamic stochastic approximation. Here Kesten’s 
accelerated scheme [Ref. 69 ] is coupled v;lth Dupac’s dynamic 
process to improve the rate of convergence. 

N. CONTINUOUS STOCHASTIC APPROXIMATION 

In order to obtain a continuous version of the stochas' Ic 
approximation method, one can replace the difference recur ive 
relation in the discrete case with a stochastic differential 
equation. Again letting the desired level of response, a, 
be equal to zero, one obtains the general expression 

X(t) = -a(t)Y(t,X(t)), 
where a(t) satisfies the conditions 



0(n““) 
0(n-2“ ) 
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oo 



oo 



and 



OO 



/ a(t)dt 




o 



o 



The above relation determines a continuous process for 
stochastic approximation of the solution to the equation 
M(x) = 0. Driml and Nedoma [Ref. 31] proved that the process 
converges when Y(t,x) is monotonic in x and vjhen Y(t,x) is 
of the form Y(t,x) = M(x) + h(t) where h(t) is an ergodic 
process with zero mean. In both cases the function X(t) 
approaches the desired value, 0, vjith probability 1 as 
t In the proof by Driml and Nedoma 



0. EXTENSIONS OP CONTINUOUS STOCHASTIC APPROXIMATION 

As was experienced in the discrete case the one dimen- 
sional continuous case can be extended to the multidimensional 
case. However many theorems which are valid for the one 
dimensional case are not valid for the multidimensional case 
which depends heavily on stationary point theorems. (I.e. 
theorems concerning a point Xq of some space X for which 
F(xq) = Xq where F maps X into X.) For a discussion of these 
theorems see Driml and Hans [Ref. 30] and Hans and Spacek 
[Ref. 6l]. 

One representation using continuous stochastic approxi- 
mation is by Kitagawa [Ref. 71] who formulated a Robbins-Monro 



a(t) 




1/t for t > 1 



for 0 < t < 1 
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model where the Brownian motion process is used to represent 
the random disturbances inherent in the observations. 



l\3 



IV. FINDING THE MAXIMUM OP AN UNKNOWN REGRESSION 



FUNCTION: THE KIEFER-WOLFOWITZ METHOD 



A problem of practical importance v/ith a regression 
function, Y(x), is to estimate the value of x, say 0, at 
which the expectation of Y(x), denoted M(x), is a maximum. 
To intuitively introduce the method consider the follov;ing 
argument from Wetherill [Ref. 131]- 

Y(x)[ [ 




(a) (b) (c) 



Figure 1 

Suppose two observations, y(x^) and yCx^), are taken a 
values x^ and x^ where x^ < Then 

. (a) If y(x^) < y(x 2 ) one expects the maximum 
level, 6, to be at a value ^ X 2 * 

(b) If y(x^) > y(x 2 ) one expects the maximum 
level, 9, to be at a value ^ x^. ■ 

(c) If y(x^) is about equal to 
observations are necessary to determine the 
region of interest. 



Thus It would be reasonable to take further observations 
in the direction Indicated by the slope of the two Y values 
and the distance moved along the x-axls , before taking 
further observations, should be proportional to the difference 
between y(x^) and yCxg). Using this basic idea and the 
initial results of Robbins and Monro, Kiefer and Wolfowltz 
[Ref. 70] defined the follov;ing procedure for stochastic 
approximation of the maximum of a regression function. 

THEOREM [Ref. 131] 

Let M(x) be a regression function and F(Y|x) a family 
of distribution functions and assume that the following 
conditions are satisfied: 



and assume that M(x) is strictly increasing for x < 9, and 
that M(x) is strictly decreasing for x > 9. 

Let and infinite sequences of positive 

real numbers such that 



f (Y(x) - M(x))^dF(Y]x) < < +» 



00 



00 





oo 




(for example: a^ 

scheme defined by 



-1 -1/3 

n and c = n ). Then the recursive 
n 



x 
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converges in probability to the maximum, 0, of the regression 
function Y(x) if three regularity conditions are satisfied. 
They are listed here with their intuitive meanings. 

Condition 1 .* There exist positive 0 and B such that 
|x^ - e|’ + Ix^ - e| < 3 im.plies |M(x^)-M(x 2 ) | < b 1 x^-X 2 | 
for all x^,X 2 * This says if the function, M(x), has 
a derivative, it must be zero when x = 0; as a result 
the derivative must be bounded in the neighborhood of 
0 . 

Condition 2 . There exist positive p and R such that 

I Xf - X 2 I < P implies |M(x^) - M(x 2 )| < R. In other 
words if M(x) increases too abruptly in certain regions, 
there exists a positive probability that it may reach 
+00 or as a result, the Lipschitz condition must 

be satisfied. 

Condition 3 » For every 6 > 0, there exists a positive 
7r(6) such that |x - 0| >5 implies 

inf > u(6). Thus if M(x) is a 

very flat function the rate of motion toward 0 is small. 
As a result, the absolute value of the derivative must 
be bounded below. 



* As Blum later proved [Ref. 8], the above theorem 
holds even when Condition 1 is not satisfied. 
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I 

I 
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While these regularity conditions seem restrictive it 

is only necessary that they hold in an interval [0^,02] 

where it is known a prioi^'i that c^ £ 0 ± Suppose, however 

that some proposed level, a^ ± c^, lies outside the interval, 

[0^,02] and one cannot take an observation at that level. 

If one then moves x so that the offending x ± c is at 

n ^ n n 

the boundary (c^ or C2) we may proceed as directed and the 
conclusions remain valid. 



A. CONSTANT COEFFICIENTS 

Burkholder [Ref. 12] proved that under certain conditions, 

the Kiefer-Wolfowitz procedure can still be used if c^ is 

held constant for all n at a particular value, c^. X is 

’On 

then asymptotically normally distributed with variance 
proportional to n~^. This result is difficult to use in 
practice since there will rarely be enough information about 
the response curve to choose Cq as required by Burkholder. 

B. CONVERGENCE WITH PROBABILITY 1 

The Kiefer-Wolfowitz process is a special case of the 
Dvoretzky process. (I.e. the process can be written as the 
sum of a deterministic term and an error term.) This can 
be seen by writing 



X = X + — [M(x +c)-M(x -c)]+z, 

n +1 n c^ n n n n n’ 



where the error term is 




I 



a 




a 

z = — [Y(x + c ) - M(x + c ) - Y(x - c ) + M(x -c )] 
n c^ n n n n n n nn 



It follows from a theorem by Dvoretzky that the Kiefer- 
Wolfowitz procedure converges vjlth probability 1 and in 
mean square under conditions weaker than those imposed by 
Kiefer and V/olfowitz. Burkholder [Ref. 12] also proved 
convergence with probability 1 using a somewhat different 
approach. Later Venter [Ref. 122] showed that the K-V/ 
method converges almost surely to the maximum if this is the 
only stationary point of the surface and some other condi- 
tions are satisfied. This result is stronger, in a sense, 
than those existing previously. 

C. MULTIDIMENSIONAL KIEPER-WOLPOWITZ 

Let (X^,...,Xj^) be a family of random variables; let 
P . . be the corresponding distribution function; 

Xq , . • • ,Xj^ 

and let M(x^, . . . ,Xj^) be the corresponding regression function 
We then desire to find a vector X = 0, for which the regres- 
sion function is maximal. Assume that M(x) has a unique 
maximum at the point x = 6. 

Blum [Ref. 7] constructed a multidimensional K-W process 
in the following manner. Let X e R„ and let (e, ,...,e„) be 
an orthonorm.al basis in R^^. Then for some real c > 0, we 
make N + 1 observations of the random variable Y(*)j 



Y(x), Y(x + ce^), Y(x + ce 2 )j 






• • • 



, Y(x + cCj^) 




1 





and consider the vector 



Y 



^ = [{Y(x = ce^) - Y(x)} 



{Y(x + cej-)-Y(x)}]. 



Then beginning v;ith some arbitrary vector, x^, construct the 
sequence 



Denote the vector of first derivatives of M(x) by D(x), and 
the matrix of second derivatives by A(x). Then the follov;in 
theorem by Blum is presented; 

THEOREM [Ref. 7] 

Let and • {c^} be sequences of positive real 

numbers that satisfy: 




where 



Y(x ) denotes Y 

n X ,c 

n^ n 



00 



00 




00 



Moreover assume that Y(x) and M(x) are such that 




I 

! 







r 







M(*) is continuous together with its first and second 
derivatives, and for any e > 0, there exists a p(e) > 0 
such that 



I jxj I > e implies that 



M(x) £ -p(e) , and 



I |D(x) I I ^ p(e) , 



where the partial derivatives 9^M(x)/9x^8Xj are bounded 
for all i, j = 1,...,N. 

Then the sequence previously defined converges 

to 6 = 0 with probability 1. Note that each step in Blum’s 
algorithm requires N + 1 observations. Gray [Ref. 59 ] 
proved that the multidimensional K-W process defined by 



X ^ = X + -2- [Y"^ - Y" ] 

n+.l n c x,c x,c 

n n ’ n n ’ n 



also converges with probability one where 



X ,c = {Y(x + c e^),...,Y(x + c e.,)}, 

n’n n nl’ ’n- nN’ 






which requires 2N observations in each step 
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D. ASYMPTOTIC PROPERTIES OF K-V; PROCESS 

The first results concerning asymptotic properties of 
the Kiefer-Wolfowitz process were obtained by Derman [Ref. 2^1 
and Dupac [Ref. 32] based on the lemmas of Chung [Ref. l4]. 
Sacks [Ref. 95] has discussed conditions for asymptotic 
normality of x^. If c^ is chosen to tend to zero, then the 
asymptotic variance of can never be made as small, in 
order of magnitude, as Burkholder’s result of being propor- 
tional to n”^ with c = Cq a constant. The most general 
results without a priori assumptions about the length of 
the interval containing the point x = 9 have been obtained 
by Sacks. 

THEOREM ' [Ref. 95] 

Let M(x) be a measurable function with a unique maximum 
at X = 0, and assume that this function satisfies the 
conditions : 

(i) inf (x-0)(M(x-£) - M(x+e); ^ q 

0<e<£ 

o 

where 0 < £ <£,<£„< 

o 1 2 

(ii) for all x, M(x) = - a^(x - 0)^ + 6(x,0), 

where > 0 and 6(x,0) = o(lx-0|^) 

as |x-0| 0; 

(iii) for some Cq > 0, there exists positive 
constants and K^, such that for all x 
and all c for which 0 < c £ Cq 
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(x-0 ) ^< ( x-G ) [M(x-c ) -M(x+c ) ] c“^<K 2 (x-6 ) ^ ; 
(iv) For every e > 0 there exists a > 0 such 
that for all c satisfying 0 < c £ and 
all X satisfying [x-G] < c 

|6(x-c,0) - 6(x+c,0)Ic~^ £ e|x-0l. 

Further assume 

LimE[{Y(x) - M(x)}^] = a^/2 
x-^0 

and 

Lim Lim Sup / (Y(x)-M(x) )^dP = 0 

R-^oo |x“0|<c I Y(x)-M(x) I >R 

Then if a = An~^, where A > 1/2K, , the random variable 
n * 1 ’ 

h 

n”^c^(x^ - 0) is asymptotically normally distributed with 

Mean = 0 

Variance = a^A^(8aA - 1)~^ 

Sacks, in the same paper, also gave the similar asymptotic 
limiting distribution for the multidimensional K-W process. 

E. MAXIMUM SAMPLE EXCURSIONS IN KIEFER-WOLFOWITZ PROCESS 
When we seek a maxlmimi or minimum using the Kiefer- 
Wolfowltz process the possibility arises that we may be 
working v.’ith a function v;ith more than one local maximum or 
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that we do not want to reduce the performance, M(x), below 
some minimum level. The value of x corresponding to this 
level may not be known. In both of these cases we may wish 
to limit the excursions to some given multiple or function of 
1 x^ - 6|, with a high probability, while still being certain 
that x^ 0 with probability 1. To accommodate this 
situation Kushner [Ref. 79 ] presented estimates of the 
following form: 

For any m < “ and even integer r, 

P[ max > e] < [E(x -6)^ + 6 ]/e^, 

r 

where 6^^ depends on the sequences a^ and c^ and can be 
made arbitrarily small for each fixed N and r, while 
X 6 with probability 1 is still ensured. 

F. ACCELERATED CONVERGENCE FOR THE K-W PROCESS 

As in the case of the Robblns-Monro process, the rate 
of convergence of the K-¥ process can be increased by using 
Kesten’s algorithm [Ref. 69 ] (See Sec. III.J). Another 
method for accelerating convergence was proposed by Fabian 
[Ref. 40] who later shov;ed [Ref. 45] that the multidimensional 
K-W procedure for functions, f, sufficiently smooth at 9, 
the point of minimum (or maximum) can be modified in such 
a way as to be almost as speedy as the R-M method. This 
modification consists of making more observations at every 
step and of utilizing these to eliminate the effect of all 
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derivatives 9'^'f/9x^'^j J = 3}5,7, • • • ,s-l. Let 6^ be the 

distance from the approximated 6 after n observations. Under 

similar conditions on f as those used by Dupac [Ref. 32] the 

result E6 ^ ) can be obtained. Under weaker 

n 

conditions it was proved that ($^2j^s/(s+l)-e ^ q with 

probability 1 for every e > 0. 

In a follow-up paper Fabian [Ref. 46] noted that there 

2 

are many designs, d, which achieve the speed of E6^ as 
stated above. He derived the dependence relation of d on 



Lim n 



s/(s+l) 




2 



2 

so that one may choose the design which minimizes E6^ . 

In yet a third paper in this series by Fabian [Ref. 48], 

2 

the results of a design which minimizes E<S is utilized 

^ n 

and Fabian achieved the result 



e1|x^ - e||^ = 

where t^ equals the number of observations necessary to 

construct x^,x.^,...,x . 

1*2’ ’ n 

G. THE CONTINUOUS KIEFER-WOLFOWITZ PROCESS 

As with the Robb ins -Monro method we have a continuous 
analog of the Klefer-Wolfowitz method. Let us consider a 
method, as discussed in Loginov's survey [Ref. 8l], for an 
ergodic random process Y^. 



Let X denote an N-dlmensional 



1 

I 




I 



vector with coordinates in N-diraensional Euclidian 
space with orthonormal basis e^,...,ej^. Then the regression 
function is M(x) = E[Y^(x)]. Moreover assume that 

^Cx,c(t)] = y^[x + c(t)e^] - y^[x-c(t )e^] , 

where c(t) is some positive function. Then the continuous 
K-W method of determining a minimum point for a regression 
function is described by the equation 



dx 

d^ 



i^ 



-a(t)I^ ^c”^(t)y^ ^[x^,c(t)] 



with initial conditions o ” x^(0), for i = 1,2,...,N, 

where 






Here G^'*' is a monotonic function with derivative bounded on 
[b^ - 5 ,b^] and 

, fo for X < b. - 5, 

c:/(x) = - ^ 

(l for X = b. , 

1 ’ 



and is a monotonic function with derivative bounded on 

[a^,a^ + 6] and 



G.”(x) = 



for X ^ a^ + 6 , 

for X = a. j 
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and F^’*’(y) = l- y/ec(t) and (y) = 1 + y/ec(t) for e > 0. 

The essential difference between the original discrete 
Kiefer-Wolfowitz method and this continuous version is the 
fact that here the observations need not be independent as 
they were in the discrete case. The term I. , serves to 
limit the variable to the Interval [a^,b^]. 

Sakrison proved the following convergence theorem for 
the continuous K-W process. 



THEOREM [Ref. 92] 

Represent y^ in the form 



N 

y. (x) = r g (x)V 
t J J,t 



where V, . are ergodlc random processes that are bounded 
J »t 

with probability one, while g.(*) are functions whose second 

0 

partial derivatives with respect to x^ are bounded. 

Now let denote any of the random processes 

or , , V , , (e, m = 1,2,...,N). Moreover let F, be ar / 

e,t+p m,t+p 5 » 5 j ^ j 

bounded functional defined on the processes V 

e , X 

and Bpp(p) = M{(F^ - M(F^))(D^^p - M(D^^p))} be such that 






where < +-». Assume that the regression function, M(x) 

satisfies the conditions 
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(grad M(z) I^^^,x-6) ^ Kp ||x-9|! , 



I I grad M(x) I I ^ - ^3 ^ ^ 



for 0 < Kp < < +~, 



9^M 



9X^- 



< P 



for 1 = 1,2,...,N. Then if the relations 



/ a(t)dt = ~ 



/ a(t) c‘^(t)dt < 



/ a(t) a(l/2t)dt < " 



/ a(t) c~^(t)dt < 



hold for the functions a(t) and cCt), the solution of the 
stochastic differential equation converges to 6 in mean 
square , i . e . 



Lim E{| jx - e[ 1^} = 0. 

t-)-oo ^ 

An example of functions satisfying the above conditions 



are 
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a(t) = 



a 



and 



c(t ) 



c 




(t+ 1 )^ 



whe re 



J 5 < a < 1 



and Y > ^(1 - «)• 



Example (by Sakrison [Ref. 92]). 



If a = 1 and Y 




then 



E{| ix^ - e| 1^} = Od/n'^). 



It is not difficult to see that in the continuous case 
the requirements of the theorems are considerably more 
stringent than those in the discrete case. Here constrain 
are imposed on the process Itself, not just on the regres- 
sion function. This is the fundamental difference betweer 
discrete and continuous stochastic approximation methods. 
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V. PRACTICAL ASPECTS 



A. CHOICE OP 

In Section III.J. a theorem of Dvoretzky [Ref. 36 ] was 
presented giving a formulation for the sequence, a^^, vjhich 
is optimal in the minimax sense. However this formulation 
contains parameters which will in general be unknown to the 
experimenter. The need then arises for a method of choosing 
a sequence. 

Hodges and Lehmann [Ref. 6^] recommended using coefficients 

of the form a = c/n where c is chosen to minimize the 
n 

2 2 

asymptotic variance, a c /(2a^c - 1 ). This leads to 
choosing c = 1/a^ where is the slope of the response 
function, M(x), at the desired level of x = 6. (I.e. choose 

c = 1/a^, where = M*(0).) This does not reduce the 

experimenter’s dilemma since it requires a priori estimation 
of another unknown parameter. It does however provide a 
basis for sensitivity analysis on expected squarred error 
based on changes in the multiplier, c, in terms of a^. 

Computer simulations were performed by Hodges and Lehmann 
[Ref. 64] and by Wetherill [Ref. I 30 ] with very similar 
results . 

In general choosing c ;< l/2a^ should be avoided since 
the asymptotic behavior is unknown and simulation experiments 
indicate that large biases exist when c is chosen to be too 
small. Similarly when c is chosen too large the asymptotic 
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variance increases, however it increases slovily for ca^ > 1. 
Thus when the value of is unknown it v-jould be more 
desirable to overestimate the value of c than to 
underestimate c. 

In the special case where M(x) is linear it is easily 
shown that a^ = c/n, with c = 1/M'(0), is a desirable 
choice. Consider the case M(x) = bx where it is desired to 
sequentially arrive at the value of x where M(x) = 0, 

Without loss of generality let 9=0. Thus the value of 
X = 6 for which M(x) = 0 is 0=0. Choose c = 1/b noting 
that b is the slope of the response function. Then for any 
initial value, x^, the expected value of X 2 can be easily 
computed since 

X 2 = Xi - ^ {Y(xi) - 0} 

implies that 

E(x2> - ^EtY(x^)}, 

where 



E{Y(x^)} = M(x^) = bx^. 



Hence 



E(x2) 




0 , 
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the desired 6. 



Thus in the linear case the correct choice of c will 
move the estimate to the neighborhood of 6 early in the 
process as evidenced by the fact that the first choice 
actually produces an unbiased estimate. 

B. ESTIMATING THE SLOPE TO IMPROVE ASYMPTOTIC VARIANCE 

It was noted by Wetherlll [Ref. 130] that in the simple 
case vjhere M(x) is a linear function that it can be shown 
that when we use as the sequence of a^, a^ = c/n that 
choice of c is critical to the efficiency of the process 
where efficiency is defined as the reciprocal of the ratio 
of the variance for a given c to the variance at c = M’(6). 
See Table 1 (also see Hodges and Lehmann [Ref. 64)). 



TABLE 1 

Asymptotic Efficiency of the Robbins-Monro 
Process as a Function of c/M’(0) 

c/M’(6) 0.50 0.75 1.00 1.25 1.50 2.00 2.50 

efficiency 0 0.88 1.00 0.96 0.88 0.75 0.64 

Table 1 shows that there is a large range of c for which 
the process is very efficient, with c = M’(6) being optimal. 
It also would imply that it is better to overestimate the 
value of c than to underestimate. 
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Burkholder [Ref. 12] discussed the possibility of 
estimating the slope of M at 6 but this procedure was not 
investigated further under Venter [Ref. 121] presented an 
extension of the Robblns-Monro procedure which estimates 
the slope of the regression function at the root. The 
method is similar to the Kiefer-Wolfowltz procedure in that 
at each step two observations are taken, namely Y’=Y(x +c ) 
and Y'' = Y(x -c ) where c = cn~^(l + o(l)), c > 0, 

0 < y ^ h. Venter required that we know constants a and 
b such that 0 < a < M' ( 0) < b < <». At each step he 
estimated the slope by where 



and then kept the estimated slope within the established 
bounds by using as the estimate of the slope where 



B = n 
n 






a 



if B < a 



A 



B 



n 

otherwise 



n 



n 



b 



if B > b 
n 



Venter then defined the recursive relation 




where 




-1 



(1 + 0(n"*^)) 
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Venter showed that if in the choice of that 

k < y < h then 



n^(^n ~ t N(0,a^/2(M' (6))^) , 



and 



n^(A^ - M'(9)) N(O,0^/2(1 + 2y)). 

n JL 



However if Y = ^ then 



n^(Xn-0) Z N(-2a2C^/M' (0) ,o^/2(M' (6) )^) , 



and 



n*^(A^-M’ (9)) ^ N(0,a^/3c^). 

Venter stated that in the case of y < ^ the bias in 
the estimate, of 0, will dominate the error. There- 

fore the choice of y = ^ gives a small negative bias but 
decreases the variance in the estimate of the slope. 

One might ask whether this modified procedure is actually 
at a disadvantage since it requires two observations per 
step. Venter showed that after n steps (2n observations) 
its variance is still achieving the minimum value of the 

old Robbins-Monro procedure after 2n steps (2n observations). 

2 

Venter also provided an estimate of o so that confidence 
intervals could be constructed for his procedure. 
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Fabian [Ref. 47] later provided a sophisticated proof of 
asymptotic normality of Venter's procedure and of a similar 
procedure applied to the Klefer-Wolfowit z method. 

C. SMALL SAMPLE THEORY 

Considering the practical applications of using stochas- 
tic approximation in experiments where infinite quantities 
of test items may not be available, it is justifiable to 
ask how small sample realizations compare with asymptotic 
theory. For Instance if an experimenter has less than say 
50 animals with which to determine the LDr-^ (Lethal Dose 

pU 

50 %) then one may be concerned with designing a stochastic 
approximation method with which to obtain the "best" possible 
results and an estimate of the expected error. 

1 . Choice of x^ : 

If one has prior information that 0 (for say M(0)= 
0 . 50 ) lies in a narrow Interval and picks x^ in that interval 
then one can expect the estimates to arrive in the neighbor- 
hood of 0 within a few observations. If, however there is 
little prior knowledge of the magnitude of 0, then an initial 
bad choice of x^ can Induce a large bias term which will 
dominate the observations for many steps. 

2. Choice of Multiplier, a^ : 

As previously discussed a^ = c/n where c equals the 
inverse of the slope of M( • ) at 0 is optim.al in a sense. 

Thus one must accurately estimate c for optimal conditions. 
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If c Is too small the step sizes may be too small to get to 
0 before the number of samples are depleted. Similarly 
if c is too large the estimate may overshoot 0 back and forth. 
For a detailed analysis see Section V.A. 

3. How to Allocate Samples; 

If an experimenter has N samples to test, should 
he test one at each step and take N steps or test m at each 
step and take n = N/m steps? Note that taking more than 
one observation at each level, x^, yields a more accurate 
estimate of M(x^) = E(y|x^). It was noted by Wetherlll 
[Ref. 130] and by Cochran and Davis [Ref. 17], and was proven 
by Block [Ref. 5], that the variance of the estimate of 0 
depends only on the total samples, not on the sampling 
scheme; however the corresponding bias term, and hence 
mean squarred error, is affected by the scheme. 

Cochran and Davis presented two graphs which illus- 
trate their analysis, which is reproduced here. In their 
notation o = the standard deviation of the observation, Y(x), 
at X = 0. (which in general will be unknown to us). Also 
note the following terminology: 

MSE: Mean Squarred Error; 

Cq : Optimal choice of coefficient, c; 

m : # of samples taken at each level; 
n : # of levels or steps, 

where nm = N = Total number of samples. 
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Figure 2 Figure 3 




Figure 2 implies that if is relatively unknovm that it 
is more desirable to overestimate c so we are not "trappe " 
by a large initial bias and small steps. Figure 3 implie 3 
that if the initial guess, x^, is more than about 2o away, 
then sampling should be done one at a time, while if the 
initial guess is very accurate, then the MSE's are smaller, 
although very slightly so, for larger m. Thus as a general 
rule unless we know that the initial guess is very accurate 
or unless the cost of setting up experiments at different 
levels is high, sampling should be conducted as one sample 
per level. 
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Another question which the experimenter may ask is how 
much more accurate an estimate becomes if he doubles the 
number of samples, say from scheme l:m=3>n=8to 
scheme 2: m = 6, n = 8. Doubling the value of m in this 
way reduces the variance of the estimate by almost exactly 
one-half, but produces only a slight decrease in the bias. 
Consequently if V, B, and M are the Variance, Bias and MSE 
for m = 3, n = 6 (scheme 1) then the corresponding MSE for 
m=6, n = 8 (scheme 2) can be predicted by the expression: 

MSE^ = (B^ + V/2) = (B^ + M)/2. (This is assuming x^ is the 
same for both schemes.) This expression overestimates the 
MSE, but at most by only a few percent. 

For further results and comparisons of methods utilizing 
small sample theory see Cochran and Davis [Ref. 17], Davis 
[Ref. 22], Wetherill [Ref. 130], and Odell [Ref. 8?]. 

D. ESTIMATION OP EXTREME QUANTILES 

For estimates of quantiles near the mid-region of a 
quantal: response curve the Robbins-Monro method appears to 
perform quite well. In fact for estimation of the 9 
quantile both Wetherill [Ref. 130] and Davis [Ref. 22] 
showed that sample sizes as sm.all as 35 produced results 
which were in good general agreement with asymptotic theory. 
However in areas away from the neighborhood of 6 the 
small sample estimates frequently have large biases and 
have variances greatly in excess of theoretical predictions. 
This behavior was also noted by Stillings and Logan [Ref. 108]. 
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To try to explain this phenomenon Wetherill [Ref. 131 ] 
presented the following example. 

Suppose an experimenter wishes to estim.ate 6 and 
that his initial level, x^, is very close to the true value. 
Suppose further that the first observation is zero, a failure 
(as it will be about once every ten trials), then the 
second observation will be taken at the level 

X2 = x^ - c (0 - 0.90) “ . 90 c. 

This value, X2 > may well be far above 8 Assume that 

the next two values will be positive (a success). This 
leads to 



x^ = X2 - |-(1 - .90) = x^ + .85c 



and 



X2j = x^ - j(l - .90) “ + .816c. 

As can be easily observed the level of testing is very 
slowly returning to the vicinity of 0 q«. In fact a minimum 
of about e^^ observations are necessary to pass below x^. 

Methods using accelerated stochastic approximation tend 
to minimize this effect but the most interesting treatment 
of this area thus far has been done by Goodman, Lev/ls, and 
Robbins [Ref. 58]. Here a "maximum transformation" is 
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employed by taking multiple samples at a level. If it is 
desired to estimate P(0) = .99, where P(x) is a cumulative 
density function, then V samples are taken at each level S. 
Here V is the solution of the equation 

(0.99)^ = 0.50 . 



Then let 



= Prob {(max S, ) < s} = [P(s)]^ . 
max ll.ilV ^ ” 

In this case the solution for V is V = 69 , and 69 samples 
would be taken at each iteration. Thus the problem has 
been transformed into estimating the 6 level where the 
properties of the Robbins-Monro pi’ocess are knovm to work 
well. 

Yuguchi [Ref. 135] followed this same "maximum trans- 
formation" technique and then applied variance reduction 
and jack-knifing techniques to improve the rate of convergence 
and to reduce bias. 

E. THE CASE WHERE M(x) STOPS BEING A CONSTANT 

Consider a response function where there is no reaction 
for x < 0. (I.e. M(x) = 0 for x < 0 and M(x) > 0 for x > 0. ) 
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M(x) 




Often times one is interested in the level, 0, when response 
first occurs (see Guttman and Guttman [Ref. 60]). Friedman 
[Ref. 51 ] proved the follovxing theorem. 

THEOREM [Ref. 51] 

Let the following conditions be satisfied; 



n 



dn ^ 0 



(i) 


lM(x)| < 


L|x| + K; 


(ii) 


2 

0 (x) £ a 




(iii) 


if X < e. 


then M(x) = 0, 




if X > 0 , 


then M(x) > 0; 


(iv) 


for every 


0 <• 0 inf |M(x) 1 > 0. 

6<|x-0| 


e {a } 


, {d } sue 


h that 


■ n 


’ n 




00 




00 ^ 00 


1 , E 


a = , 


E a‘^<“, d >0, E 


1 — 1 

Ii 


n ’ 


n=l ^ ^ . n=l 


( the relation 





n n 
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X., = x + a(d - y) 
n+1 n n n n 



Then 0 with probability 1 and in mean square. 

This theorem says that one can use stochastic approxima- 
tion to find that point at which the regression function 
stops being a constant if the value of this constant is 
known. If one does not know the value of the constant, 
Friedman has proved another theorem v;hich Imposes sharper 
conditions on M(x), for which x^ does converge to the 
desired value 61 

F. BOTH VARIABLES SUBJECT TO ERROR 

In the usual Robblns-Monro procedure it is assumed that 
the regression function, M(x^) is observable subject to an 
error term, say v^. One might ask under what conditions will 
the process converge if there exists a random error compon- 
ent, say u , in the level setting of x as in practice it 
X n 

is not always possible to precisely measure or set the 
desired amount. Diipac and Krai [Ref. 35] discussed two su jh 
cases. In the first case the error in setting the level is 
assumed to be unaffected by the experimenter. In the second 
case it is assumed that the error in the x level can be made 
arbitrarily small for an inversely proportional price. In 
this first case of "irreducible errors" Dupac and Krai 
proved the following theorem. 
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THEOREM [Ref. 35] 



Assume that the following conditions are , satisfied : 

(1) M(*) is odd with respect to 9, 

l.e. M(e + x) = -M(6 - x) for all x; 

(ii) M(x) is strictly increasing; 

(iii) I M(x 2 )-M(x^) I + C 2 |x 2 -x^| for all x^,X 2 

(iv) U is a symmetric random variable for each 



X, i . e . P (U 


< c) = P(U 


Var U < C^ 


for all x; 


X - 3 




Var < C^j 


for all X. 



Then the Robbins-Monro procedure defined by 



n+1 



= X - a {M(x + u ) + V } 
n n n X x 



converges to 9 with probability 1 as well as in mean squar . 

In the second case of Dupac and Krai, where one can 
decrease the x setting errors, U , by an inversely proper- 
tional price, they proved what intuition would tell us v;as 
correct. They showed that it is needless to pay for high 
precision at the starting steps; the precision should be 
increased in the course of the approximation process. 

G. THE CASE OF a UNKNOWN 

Consider the following scenario: Suppose a scientist 

is comparing two drugs, a test drug and a control drug. 

He is Interested in designing a biological assay to estimate 
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the number of dose units of the test drug necessary to 
elicit the same mean response as the standard dose of the 
control drug. Suppose further that the experimenter knows 
little about the shape of the response function associated 
with the test drug and about the probability distribution 
of response at any one dose level of either drug. 

Make the following notational identifications: Let an 

observed response to the control drug administered at the 
standard dose level correspond to the random variable, Z, 
with mean a. Let the observed response to the test drug 
at dose level x correspond to Y(x) with mean, M(x). Let 
6 be the unknown dose level of the test drug such that 
M(0) = a. Then under weak conditions on M(x), and the 
distributions of Y(x) and Z^, the process defined by 



X 



n+1 



X 



n 



- a {Y(x ) 
n n 



z 

n 



} 



satisfies all known properties of the original Robblns- 
Monro procedure. It seems, as was noted by Hamilton [Ref. 62], 
that this procedure does not use all available information 



-1 



n 



at each step. Since n E Z. is a better estimator of 

i=l ^ 

a than just z^, one would expect a smaller mean squarred 
error from the sequential estimate of a, especially in 
cases of small sample sizes. To analyze this Hamilton 
compared two processes. 
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Process 1 takes multiple observations at each step 
and computes an estimated value of a based only on 
the observations taken at that step (possibly 
only one). 

Process 2 takes the same num.ber of multiple obser- 
vations at each step but computes the estimate of 
a based on all of the observations from the 
beginning of the process. 

Hamilton then showed that under certain conditions it 
is better, in magnitude of mean squarred error, to take 
the most recent control observations (process 1) rather 
than taking sequential steps tovmrd the mean of the control 
observation. This result, based on large sample theory, 
remains true in a simplified (linear) small sample situation. 
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VI. APPLICATIONS 



In this Section several applications of stochastic 
approximation to a variety of fields will be presented. 

The first example is an application to a problem in bio- 
logical research by Guttman and Guttman [Ref. 60 ]. It is 
initially presented since it is a simple, straight forv/ard 
problem of the type for which the Robblns-Monro method was 
conceived (also see Hawkins [Ref. 63] )♦ This straight- 
forward use of the R-M method is also applicable to indus- 
trial process control as discussed by Comer [Refs. 18 and 
19) where a lag in process response is Incorporated into 
the formulation. 

However, more practical use of stochastic approximation 
is based on the concepts of maximization or minimization of 
functions. Many problems which can be analytically solved 
if the response format is known fall nicely into the sto- 
chastic approximation framework since answers do not depen 
on the assumed parameterization. Also many problems based 
on a criterion, such as minimizing expected squarred error, 
can be computationally very difficult to solve, as the 
solutions may require matrix inversions, as in the multi- 
dimensional case. Many problems of this type (see Sardis, 
Nikollc and Fu [Ref. 99 ]) fall into the stochastic approxi- 
mation framevxork and yield computationally simple algorithms 
which require very little storage space when performed on 
a digital computer. 
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In a recent book edited by Mendel and Fu [Ref. 83] a 
chapter has been devoted to applications of stochastic 
approximation methods. Also Tsypkin [Ref. 112 ] has nicely 
reviewed the im.portant applicability of the Robblns-Monro 
process and related stochastic approximation methods to 
problems concerning pattern recognition, adaptive filters, 
adaptive automatic control systems, and adaption in opera- 
tions research and reliability theory. Some of the additional 
papers which have considered these latter types of application 
of stochastic approximation are by Aizerman et al. [Ref. 1 ], 
Ernst [Ref. 38], Kallath and Schalkwljk [Ref. 67], Lee 
[Ref. 80], Sakrlson [Refs. 93 , 9 ^], Sklansky [Ref. IO6], 
Tsypkin [Ref. Ill] and Ulrich [Ref. II6]. 

A. APPLICATION TO A PROBLEM IN BIOLOGICAL RESEARCH 

Guttman and Guttman [Ref. 60] desired to treat Para- 
mecium Caudatum cells with a substance, klnetin, which wou i 
stimulate cell division, and to estimate the time at which 
a certain level of- this cell division was attained. They 
postulated that the ratio of the number of dally cell 
divisions of treated paramecia to untreated param.ecia (K/C) 
was a monotone increasing function of time of exposure to 
kinetln. Guttman and Guttman stated that they had no idea 
of the underlying probability distribution concerning the 
ratio, K/C, thereby making stochastic approximation a very 
convenient schem.e. A Robbins-Monro scheme vjas formulated 
to estimate the time at which K/C = 1 . 10 . The initial 
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guess of = 30 hours was chosen with the expectation 
that the desired value of X was somev/here below this. 

The sequence chosen as 20/n to allow for 

large corrections in the first few steps and smaller 
corrections thereafter. The stochastic approximation 
sequence, as formulated for this problem, then looked like 



X = X + 
n+1 n 



20 

n 



(1.10 - y ), 

n ’ 



where Y = the observed response ratio at time X . Guttman 
and Guttman *s table of observations, Y^, and computed next 
levels, X^, is reproduced in Table 2. 

The experiment v;as terminated at n = 13 as no appreciable 
differences appeared among the X^ from trial 6 onward. 

Note that the mean value of the observations from n = 6 
onwards is in fact equal to 1.10. 

B. AN APPLICATION TO TAILORED TESTING 

Suppose an educator or psychologist desires to measure 
some mental trait of an individual. For Instance suppose 
it is desired to measure the level of difficulty of questions, 
X, such that the individual will get, say a = 70^ of them 
correct. Suppose further that the educator has a bag full 
of questions, each assigned a level of difficulty, , such 
that the probability that an individual, whose true ability 
is at level 1, will correctly answer a question of difficulty 
B^ is equal to a = .70. This is sim.ilarly written 
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1 

2 

3 

5 

6 

7 

8 

9 

10 

11 

12 

13 



TABLE 2 



Stochastic Approximation of Hours of 
Treatment Required v/ith 1.5 mg/1 
Kinetin to Produce an Expected Ratio 
of Divisions Kinetin/Control Equal to 1.1. 

HOURS OP TREATMENT (x ) 0BSER\rED K/C(Y ) WEIGHT (a ) 

n n n 



30 


1.067 


20 


30.7 


1.30 


10 


28.7 


1.131 


6.67 


27.3 


1.223 


5 


26.6 


1.577 


4 


24.8 


1.133 


3.33 


24.6 


0.89 


2.86 


25.2 


1.00 


2.5 


25.5 


0.81 


2.2 


25.6 


1.31 


2 


25.1 


1.21 


1.82 


24.8 


1.03 


1.66 


24.9 
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P^(B^) = a. 



This idea was presented by Lord [Ref. 82] who proposed 
a computer controlled testing scheme where questions of 
difficulty would be recursively selected by the scheme 



B 



i+1 



B. + a. {Y(B. ) 
1 11 



a) . 



Thus the scheme would eventually converge to the 
individual's true ability, provided that the assumptions 
v;ere correct. 

C. UPGRADING OP INERTIAL NAVIGATION SYSTEMS 

Consider a navigational platform with several high grade 
Gyro's required for motion sensing. Bernard Lee [Ref. 80] 
suggested replacing all but one gyro with a lower grade, l-''ss 
expensive gyro. A supervisory system based on a continuou 
Keifer-V/olfowitz stochastic approximation algorithm sim.ilc • 
to that developed by Sakrison [Ref. 90] is then used to 
estimate the drift rate of each of the low grade gyros and 
to apply a corrective signal. This concept permits each 
substandard gyro to acquire a precision approaching that 
of the higher gyro. 

D. APPROXIMATION OF DISTRIBUTION AND DENSITY FUNCTIONS 
Consider the distribution F(a) = Prob [X £ a] where 

X is a scalar random variable. The problem is to find an 
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approximation to F(*) by a linear combination of a previously 
chosen vector of functions, ^ (x) = (f^(x), f2(x) , . . . ,f^(x) ) , 
where the superscript T denotes the transpose of the column 
vector (j)(x). Thus we desire to find a column vector of 
coefficients, C, such that our approximation 

F(x) = 4, (x) 



minimizes some criterion such as minimizing expected squared 
error in a region of interest (a,b). Denote the mean square 
error as 

•b T 2 

Jp(C) = / {F(x) - Cp 4>(x)} dx. 

"a - - 

Now minimizing Jp(C) is equivalent to solving the matrix 
equation 



^ = y F(x) 4,(x)dx - / 4>(x) 4)'^(x)dx = 0 

dC a ■ " ~ a ** ~ 



or 



/ F(x) 4)(x)dx - K Cp"^ = 0, 



where 



K 



T 

/ (f)(x) 4) (x)dx 

a “ 
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is an n X n matrix. 



Novj define a random function Zp(y,x) such that 



Zp(y,x) 




if y < X, 

if y > X, 



and such that 



E[Zp(y,x)] = l-F(x) + 0(1 - P(x)) = F(x). 

Thus the regressive matrix equation 

h m 

E { / Z(y,x) <J)(x)dx} - K C„ =0 
a - 

is equivalent to our previous equations for finding the 
minimum of the criterion, Jp(a). But this can now be solved 
by a stochastic approximation algorithm if successive 
independent samples of the random variable, Y, are availab e. 
The algorithm can be written as 

Cp(J+l) = Cp(j) + a^. CBp(Y(j) - K Cp(j)] 



where we define 



b 

Bp(Y(j)) = / Zp(y (j ) ,x)^ (x)dx 



i. e . 
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if y (j ) £ a 

if a < y (j ) < b 

if b < y( j ) , 

where y(j ) is the sample from the distribution and 

the sequence a^ satisfies 



00 




00 


I. a, = “ 
J=l 


and 


E 

J=1 



Thus the above algorithm now fits the format of multi- 
dimensional stochastic approximation. In particular, if 
the matrix K is positive definite, it satisfies the conditions 
of a theorem by Blum [Ref. 7j theorem 2]. 

Then the sequence Cp(j) converges with probability 1 to 
the value which minimizes Jp(C). This value can be written 
as 

-1 ^ 

Cp* = K ^ / F(x) ^(x)dx, 

""a * 

but requires Inversion of an n x n matrix to solve directly. 
Therefore the above algorithm enables one to find a minimum 
mean square error approximation to a distribution function 
for which the only available information is the collection 



/ <t)(x)dx 



B„(y(j ) ) 



_ / 



■y(j ) 



/ 4>(x)dx 
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of sample values randomly selected. This algorithm Is from 
a paper by Blaydon [Ref. and can be similarly extended to 
approximate density functions. 

A refinement of this algorithm by Blaydon was presented 
by De;user and Lainlotis [Ref. 27]. The refinement incorporates 
a double stochastic approximation algorithm to recursively 
generate a matrix from each independent observation and 
then to recursively generate the estimate of the coefficient 
vector using the previously generated matrix as an observa- 
tion. Deuser and Lainotis presented the example where the 
unknown probability is F(x) = 1 - e for x 0 . 

The approximating function, P(x), is to be a weighted 
sum of the first three Laguerre polynomials 

4>(x) = { 1 - X 

Vl - 2X + ^ 

and the initial choice of the coefficient vector is the 
zero vector. It can be shown analytically that the optimal 
coefficients are: 

= (.480 -.186 -.239) 

In a computer simulation using 1000 samples and using 
the step sequence a^ = 1/n, Deuser and Lainlotis obtained 
estimates for the coefficients which, on the average, did 
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not differ by more than .01 in absolute value from the 
optimal coefficients. 



E. AN APPROACH TO PATTERN RECOGNITION USING STOCHASTIC 
APPROXIMATION TO MINIMIZE RISK 

Consider a mixture from two samples where an observation 
which is drawn at random is of type 1 with unknovm proba- 
bility p, and is then of type 2 v;ith probability 1 - p. 

It is desired to measure some quality of the samples, call 
it X, and apply a decision rule say 



where a^ = 6 is some unknown value which minimizes a risk 
function, R(d(x,a)) which we have chosen. Since the choice 
of a completely specifies the decision rule and risk function 
denote them d(a) and R(a) . 

Now R(a) can be viewed as a regression function. By 
this it is meant that there exists a random variable, Y, 
with conditional probability distribution function F(Y|a) 
such that 




1 if X < a 



2 if X > a 



R(a) = E(Y|a). 



Such a random variable, Y, is defined as follows: 



Let Y (given a) = C . . if Z is an observation vjhich 



actually is of type 1 and is classified 
by d(z ,a) = type j . 







' 

I 

I 



In general C 



0 . 



ii 



Then a simple one dimensional stochastic approximation 
scheme can demonstrate the solution process. Consider a 
test -sample where it is not knov;n of which population each 
item is a member. Then define the scheme 



n+1 



= a - 
n 



n 

2 d 



(Y'^ - Y"), 



n 



where Y"*" = if sample ^2n-l actually of type i and 

d(Z 2 ^_^ ^n'*’^n^ “ ~ *^ 10 ' sample is actually 

of type i and ~ ^n^ ~ type j, where is chosen 

arbitrarily and the conditions 



00 



I 

n=l 



b 



n 



00 

5 



Llm 

n->"» 



d 



n 



0 , 



00 ' 

Z (b /d )^ < +~, 
, n n ’ 



are satisfied. Note that the risk function must satisfy 



sup D R(a) > 0 

1/k <a-6<k 



and 



85 



1 



I 



I 






inf D R(a) < 0 for all integers K, 

l/k<a-e<k 

where D R(a) = the limit superior of 

for h -- 0 
h 

and D R(a) = the limit inferior of 

for h 0. 

h 

Note that R(a) does not have to be differentiable at all a. 
Then if the above conditions are satisfied, a^ converges 
in probability to 0 and lim [(a - 9)^] = 0. 

n->-<» ^ 

Then the decision rule xvhich minimizes the risk function 
is 

1 if X < 0, 

d(X,0) = 

2 if X > 0. 

The above one-dimensional scheme was presented by Cooper 
[Ref, 20] who stated that the application to a K-dlmensional 
scheme including noise could be performed by miodifying the 
above procedure to the multidimensional case of Blum [Ref. 71 
It is noted that the above sample falls into the frame- 
work of Bayesian learning and decision rules. An excellent 
paper by Chien and Fu [Ref. 13] discusses Bayesian related 
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learning procedures which can be shown to be a special case 
of stochastic approximation algorithms and hence can be 
carried out in computationally simple schemes as the one 
just presented. 
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VII. AREAS FOR FURTHER STUDY 



This Section is devoted to stating particular areas 
where further work may' be of interest. These ideas have 
been noted as either not being discussed in the current 
literature or as having been analyzed when required 
conditions were not satisfied. 

A. DEVIATIONS FROM THE LINEAR CASE 

Section III.G discussed the estimate of expected squared 
error for the linear case and mentioned that other than a 
sampling experiment by Teichroew, very little analysis had 
been done. Work needs to be done in this area to determine 
limits of departures from linearity where linear results 
remain valid. 

B. STOPPING RULES 

Stopping rules not based on bounded confidence interva s 
utilizing aymptotic normality are almost nonexistent. Whs 
is needed is some nonparametric stopping rule based on say, 
number of changes of sign of (x^ - Many authors have 

noted that this is a virtually untouched area yet almost 
nothing has appeared in the literature. 

C. POSSIBLE \^^AKENING OF CONDITIONS ON a 

n 

In Comer's paper "Application of Stochastic Approximation 
to Process Control" [Ref. 19 ] j an error in the formulation 
of a Robbins-Monro process yields interesting results. 



88 



I 

I 



! 



I 

I 

I 

I 

I 



I 



I 



1 

I 

I 

I 

I 

I 

I 



5 

Comer mistakenly used the step sequence = l/(n)‘ in 

a simulation comparison. Note that this sequence does not 

“ 2 

satisfy the requirement Z (a ) < <». However his results 

n=l ^ 

when compared with the same simulation using a^ = 1/n, 
which does satisfy the necessary requirements, shovjs that 

5 

the sequence, a^ = 1/n gives comparable if not superior 
results. The idea to explore is (1) Comer's simulation 
error, or (2), can the conditions on a^ actually be weakened 
in practice to obtain more desirable results. 

D. REPEAT SIMULATION OP THE KIEFER-WOLFOWITZ PROCESS 

In a previous simulation comparison study of Kiefer- 

Wolfowltz type methods. Springer [Ref. 107] used as a 

sequence of normlng constants the sequence where = ^n/2 

He discussed the result of finding a small sample bias which 

one should note, can be attributed to the fact that this 

00 

sequence does not satisfy the assumption that Z a = “. 

n=l 

Perhaps a new simulation study using proper coefficients i 
in order. 

E. MULTIDIMENSIONAL EXTENSION OF DUPAC AND KRAL's RESULTS 

Dupac and Krai [Ref. 35] (see Sec. V.P) examined the 

Robbins-Monro one dimensional case where there are errors in 

P 

setting the X-level. They cited conditions where -*■ 9 
when these errors exist. They noted that errors of this 
type make the Kiefer-Wolfowltz procedure practlally 
inapplicable to this type of analysis, but speculated that 
a generalization to the multidimensional case might be of 
interest . 
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